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. - > ■ / abstract 

"Outlier studies" .identify items for which extreme differences dn 
^rformance by^*^^ntrasting\group^pcc;ur; these , extreme items are the 
"outliers"' referred to. Review of the studlles' conducted on tests 
receiving major use ip higher education reveals that though one cannqt 
make a priori classifications of outliers with. confidence, one can with 
reasonable confidence predict the relatijy^ly advantaged group for many 
verbal items if they subsel^ii^^tly prove to be outliers as follows: * 
aesthetic-phJI^osophical, human- relations or female oriented content \ 
relatively favors females as opposed to males; practical affairs, science 
or male oriented content relatively favors males as opposed to females; 
science content relatively ^favors whites as opposed to blacks. For test 
content that varies in the degree of relatedness to minorities , one would 
predict a relative advantage for those outliers that ar^ most related to 
m^oritiesl^ The magnitude of the differences found is not, large; perhaps 
larger differences would be found if classificationis other than^ race* and 
"Sex, which^are the most common, were uied. It h^s been fbund that 
differences in cultural or national origin produce' larger discrepancies 
in item difficulty than differences in race ' or sex of essentially 
na t Iv^ J^e r i c an groups. ^ \/ , - 



, INTRODUCtlON 



I 



The test item perfx>rinflnice8 of groups that diffe? by race,' sex or 
Other socially relevant characteristics have long been of scientific anji 

. practical; interest • ^ recent years such work has frequently ^fQCusechi3n 
"outliers" in analyzing the performance of such groups on tests,. By . 
"outliers" is 'meant items for which group differences In performance 
markedly .exceed those of most items in the test or sub-t^st in which the 
outlying items' are imbedded. .The differences on which the identification j 
of outHets has been based have been called "performance differentials" 

'and "performance discrepancies," Carlton and Marco (1981) have identified 
19^ ^alyses of differences in item performance; Linn & Harhish (1979) and 
Linn, levine, Hastings, and Wardrdp (1981) offer examples of studies of , 

^iffereptial item performance. 

Three motives for such. studies can be identified: one motive is 
to arrive at some explanation of observed group differences in performance; 
a second |,s to identify sources of test score variation that are irrelevant 
to the quality for which measurement is needed so they can be eliminated; 
a third is to minimize the unfairness of tests to particular groups of 
people. Each of these purposes might be served by identifying categories 
of item content that are associated with the particularly good or bad 
performance of any particular grpup relative to another. The identifi- 
cation of item content associated with group performance discrepancies on 

f. - * ■ - . • ■ 

particular items might lead to fruitful speculation as to why that 

content favors a particular group, perhaps suggesting what, or ^whether 
^inything, should be .done about it. In particular, if the source of 



perfotmance variation by girodps is irrelevant to the purpose of measurement, 
^ it can be eliminated. Indeed'^ to the extent; that one can modify the 
content of fa/ te^t wi^^ut affedting the relevant measurement aspects 
or the intellectual task, it ^poses , one can choose content that bala/nces f Y 
> out the ulif air advantage of particular groups (Ddnlori, 1973; Medley & 

Quirk/ 1972 rFlaughjBr & Schrader, l^/'S). " \; . / ' . .y ^ 

Thd jit^to categories of interest here are/V.Ot those by Which items" v 
are curr^ntly.ciassifled jW^ 

differences in the usjuai tegt. or sub-test types are weir known. ' Knowing 
• about them has not; led to t^heir control/ and. i^t|;is ^ and 

:\ \ , ■ . ■ .. ■ ■' ' '{ -/' ^ " 

.socially unacceptable t:o leave it that the, performance Idifferences. exisr . 
"because of" sex differences or racial* differences^ i« certainly . 

scientificallyr unwarranted to associate the differences observed with 
hereditary effects or to assume that they are envixonnieht ally ^imputable 
simply because we have discovered rfb plausible explanation fof them.',, We 
the^refore turn to the examination of within test'' or s'ub--test' differences' 

in item performance to formulate ^^ypo theses about grd^a^differetic 

■ ' " , ' ■' ^ ■ . . ■ ./ ^ 

test performance. " ' ? 

The purposd 9f the «i)resent project was to review reports on outliers 
and related repdfet&'iri order to summarize their implications relating .V . 
group. differences and item -conteiit . This review .is , confined to studies 



of tests that receive^ major use in higher education., The discussion 
uses aptitude and achievement as the major category heads. Despitq, th^ 
fuzziness of this distinction in many contexts, aptitude and achievement 



can^be clearly Enough separated in the studies of irnterest here.^ Before 
beginning the aptitude section of this report, however, it will be useful 
to' give ;a v^^y brief review of the research methods involved. ^sHepar^, 
Camiliij^and Averill (1981) give mote extensive discussion, bibliography 
and a study of these methods, which they call "procedures for detecting 
test iteni' bias*'- ^: 



METHODOLOGICAL BkcKGROUND ' x 

' ■ ■:/ ■ ■ . . J \ \ ■ 

/ Several definitions of outliers were used itg the studies reviewed 
here; they do not necessarily identify the same items (Strieker, 1981). 
These -definitions may be differentiated by .the type of item statistic 
examined to establish the extent of the performance discrepancy, and by 

the type of inferential procedure used , if any, to separate outliers from 

■ . • . • • ■ ' ' . ■ '/ 

the rest of the items. , | , ^ 

The item difficulty, P , equdl to the fraction of people passing the 
item or a transformation of that fraction, is the most common statistic 
examined to express differential performance. The most commonly used 
transformation of P is Delta, which is the probit of P using a central 
tendency of 13 and a standard deviation of four. The Arcsin transformation 
has also been used. A plot of P against Delta produces the familiar 
ogive; a plot of P - against Arcsin P bears a very high resemblance 
to the ogive. A virtue of these transformations is that their use tends 
to linearize bivariate plots when the axes of the plots are defined by 
the transformed difficulties. Indeed the -linear relationship is so good 
that Echternacht ( 1972) proposed setting up confidence bands based on an 
assumed joint noi;mal distribution of the. Deltas; items' falling outside 
the bands would be considered outliers. " ^ ■ ^ 

Item difficulties are probably used so often Jbecause of their close * 
intuitive relation to the average group performance. In fact, except^or 
conventional allowances for drop-outs, omits and guessing, the average 
test performance is within a linear transformation of the sum of t^he item 
difficulties. Thus, average test performance for a group tends to be ^ 
monotonically delated to the sum of Ps or their transformations. • 



The item-test score correlation coefficients are sometimes studied \ 

" ■ ■'" 

also. Differential -performance on these correlations indicates a discrep- 
ancy in the discriminating power of* an item in one group, ^is opposed to 
another. In a rough sense, comparisons of such correlations index the 
extent to which the item (or test) measures the same thing in one group, 
as it does in another, . . 

^ Measures of difficulty or of item^test association have heen the. 
elements of many of the comparisons that hdve led to identification of 
items as outliers. In a simplest procedure, one calculates for each 
group the statistics on which comparison is to be based and takes their 
differences. The items for which the differences are extrfeme are the 
outliers* This procedure could also , be accurately descriHe|d as follows: 
Generate a bivariate plot , where' items are the points and me axes 
are the values of the item stati'stics computed ufiiing the' groups being 
compared; put a 45° .reference line through *the origin of the plptj 
identify as outliers those it^s located farthest from the reference • > 
lljie. A more common procedure is to use, for each group, the deviation p^f 
item statistics from the group. mean, in effect passing the 45 line 
through the cent roid for the plot. This procedure" was used by Co ff man 
(1961), Donlon (1973) and Levin (1970). The procedure of the plot 
reflects the variation in performance as referenced by performance on the 
^est of the items. Angoff and Ford (1973) further modified the procedure, 
suggesting the principle axis of the plot *as the reference line, rather 
than the ^5° line through the origin or centroid, and gave equations 
for the line and distances from the line. Most studies have used Angoff 
^and Ford's approach with item Delta values of the different groups 



defining the reference axe^« IJslng the distance' from the principle axis 
of the item plots will be* referred to hereafter as the Del^ta-plot method. 
. A , In using the major axis, or the centroid as the reference line, ' 

^overall test performance modifies the definition of outliers. Indeed, Ian 

' ' o ' ' ■ ■ • ' " ■ 

^item that plots^ almost directly on a. 45 line through the origin would 

■ ■ ' - " ■ ' ' ■ ■ ." ■ " ■ ■ ^ , ■ 

be Identified as an outlier if the performance of the two groups were" 

,■ . ; ■ . ■ ; . ^ , . ^ ^ /V / . ' ' - • ' ■ : 

sufficiently' different in the rest>of the items (as happened to Dc^rtion in 

r-. ■ ^ . . ■ . i . . \ * .. 

his 1973 study of mathematics- items) V* Some other methods of identifying 

• ' ' ^ ' ^ ' ■ '■ '•■^ ' ' .■ \. ■ 

outliers also take total-test performance into account in some way. OiTe . 
of them is to produce^ matched saitfples by selecting a sample from the 

largest group that has the same total iacore distribution as that of the » 

■'./..■ ■ ' 

smaller group. Then the 45^ line can^be used as the reference line' 
(Cowell, 1968). Item De],ta equating (Hecht & Swineford, 1981), which*^- 
computes the Deltas on different groups but then adjusts them for popu*~- 
Xation differences and uses the 45° reference line, Was applied by 

, ^ ^.■ - ■ ■[. . ^■ '■''\-' - - • 

Conrad & Wallmark (1975).. Because In these methods the advantage of a ^ . 

• ■ •■ - . • ' ", * ' ■ ■■. ' \ ■ ' • . ■■ 
group -on an item is defined relative to the common trend rather than 

' ' •■ ■ ■'".^ ■ ■ ' ■ ■ ' ■\ . ' • 

xelative to the raw difference, the term "relative\ advantage" is used 

repeatedly in the paragraphs to follow to emphasize that the dif f erential> 

item performance is evaluated in the^ontext of the performance, on all V 

items. - ^ 

' . ■ ' ' . ■-■ '■ ■ ■ A - ■ ■■ ■■ ; 

In a correlational approach, Strieker (1981) explicitly uses the 

total test score to control the interpretation of performance differences. 

. Strieker gets a [Partial correlation of item perfortoance with race control- 
■" . ■ ' . ■ •; » ■■ ■ • ' , • 

ling total test scores so t^hat, in a linear system, he indexes the 

association Cf race with it^m performanceSfor people of the* same 'ability 
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level. Other methods allow relaxation of . the linearity requiremegfjt, by 
controlling on total score uering a group by score level contingency table 
(Alderman and Holland, 1981) • In this approach'^ which was proposed: by 
Scheuneman (1979), ,one hopes that by keeping score intervals narrow 
enough, group differences in' total score yithiri the intetvals wtll be 
minimal, and in the ^absence of a group effect the proportion pass would 
.be about the' same. Various Indices of the success of thls^ hypothesis 
have been proposed . ' , . ^ . i 

■ ' ^ ■■ - i ■ ■ • ■ : ■. V 

Rathertthan use the total score, Wightman (1979) and Marco (Lord, 

■ ■ . : r.-' .■ ■ ■^. •■ 

1977) have used the ability inferred in an item response theoretical 
approach. They constructed separate item response curveS' for^^jp 
groups, and then conducted significance ^hests to see if the .'parameters 
were the same within statistical variation. Content interpretations by 
these authors are not available; Lord (/l977) comments that interpretation 

. . . /■•^■- : ' ' ■■■■ ' 

lt;s ha^s not been successful, ^caricker (1981) also uged. 
item response theory and obtained some reliationship with context. He . did 



of Marco's re su — ^--«L 



not, however, use the item response curye for interpretatron of results. 

■ *• I . ■ . ■ ■ ■ " ' 

Such results are not, therefore , available for this .paper. ^ 

•■ ■ ^ ' ■ ' ■ ' , ■ ' ' * ' " ' ," 

The paragraphs above, attempt to indicate the various methods of, 

identifying outliers in a way that maintains ^sdme rough 'conceptual . 

relationship among them.. The methods ' are not essentially ; interchangeable 

..though they are undeniably related. Nor*i^any great preference . for a 

particular method im^ied« * Rather, the melthods have their particular 

advantages and disadvantages that liave yet to be sorted out. That 

sorting excedes the scope of this paper. , ^ . 



One common feature of the methods is the use of some numerical 
standard by» which the degree of departure from a common standard Is * 
measured. For some methods the standard Is the distance from a reference 
line, for others It may be a chl-square value or a significance level. 
It Is computed for all Items In the test or sub-test under study and. 
tan be used wlth-a cut score that separates outliers from the rest. 
Fot most oi the studies referenced in this report, . the choice of cut 
scores is based on. one of , two types of rationale: statistical significance 
or the number of items identified as outliers.* Where the sample sizes 
are large, many outliers are found because of the great power-of signifi- 
(iance tests conducted^ with many replications . ^ Indeed Donlon (1973) and 
Strieker . (1981) all used significance tests with large samples and found 
marty J^utlieriB." We would expect M^rco (Lord, 1977) and Wightman (1979) 
also to find many /outliers. -In such caises the cut score might be very 
close to 'the average value of the numerical standard for 'all the items on 

. ■ . ■, . / - ■ . ' ■ ■ ■ ■ " " ■ 

the test This /result merely stresses the fact that evaluation of the 
social ,sigHif ic^nce of group discrepancies in performance requires more 
information than the significance test provides. Angoff (1981) has* > 
concluded that bases other than trarditional significance tests should be 
sought to; evaluate differential /performance in outlier studies. • 

APTITUDE • > ■ . 

Miany parti tlonlngs of candidate, populations- are possible, of course, 
but the -usual context of^oufeiiM studies is that of bias in testing. 
.Hence we have results only for* s^x7atid,ra<ie "^groupings. 'When the two 



principle types of academic aptitude variabla^^ verbal and mathematics. 
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re cross classified witfii the two groups four combinations result. These, 
comblnatldns give the groups of 'aptitude studies- that are^ covered. ^ V • 

Verbal-Sex , . ' 

Male/female differences In t<?£ft performance have long been recorded 
In the differential psychology- literature. Traditionally women are 
regarded as more verbal, tnen. as m6xe quantitative.^ However,' the differences 
are not uniform across Items. Coff man ( 1961) , based on an examination of 
extreme PSAT Item differences, hypothesized that Items related to "people" 
might favor women relatively more; Items^^jc^lat^ to "things" might tend 
to favor men. He. 'speculated that the test could be manipulated to 
product relatively better performance for either sex. Donlon (1973) 



noted that the advantage of females In verbal performance seems to have 
disappeared. He conducted an analysis like that of Coffman using data _ 
for a 1964 SAT admlnistratloTi'.^ he examined^ the Items with , 

the most extreme differential perfdraanc^* and, felt that the Coffman • > 
surmise was supported. Straus§berg-Rosenberg and Donlon X 1975) , using 
the Delta-plot method, obtained, re 

(1961) and Donlbn (I'^yS) in that the'devi^nr^items that fgH/ored males ^ - , 
/^j|ended to be "thing"-relatetf , vand those that favored females tended to be 

"person"-related. . Straussbergr-Rbsenberg and Dorilon f urtheir elaborated 

\ , * ■ ■• ■ ■ . . • , ■• . * •. . 

y the "peopl6"^"Miing" principle by d'dentifying the te%t developer categories 

\' of "world of prabtical affairs" and "science" with "things," and "aesthetic- * 

; philosophical" and "human relationships" with "people." The definitions 

df these categories are given below: ; ^ * ^ ^ 

r A. Aesthetics-Philosophy items deal vlth art, architecture. 
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\^ literature, drama, music, religion, philosophy; 



B, , World of Practical Affairs items 'deal With economics, gover^nment, 

history, politics, transportation, communlcatioti, sports; 
€• Science Items d^l with reseiarch, iiiathematlcs, agriculture. 



engineering, medicine, i&eatheri maiiual arts , Inventions, j 

. ■ • ■ /■ ■ ; ■ : ■ . ' ■ ■ . , ■ : ' ^ :," ■ ' . . . , ,•' :. . ' . i ' \y. . ' ' 

geography, psychology; ; ' ' . * ; 

' • ■ ■■ ' • ■ • . ' ■ .' ../.'•■■. 

D. Humai^relatlonshlgs Items deal with interpersonal relationships 

character^analysis, emotions, family. , 

Items classified ac^ibrding to the^e test development categories tenci to 

deviate from the major axis of the Delta plot in the expected direction. 

Reference to male-female characters seemed not to favor either sex. 

Using SAT items in another analysis like that of Straussberg-, - 

Rosenberg and Donlon, Stern (1978) supported the finding that aesthetic- 

phiidsophical and human relationship items favored white females, | 

relatively, while scieiice and practical affairs items relatively favored 

both black and white males. 

Donlon et al. (1980) studied sex differences on the Graduate I 

... _ . ' . ■ ■ » . . . ■ , • 1 ■ . ■ ■ 

Record Examinations Aptitude Test (GRE-AT). This study also used the 

■■■■■■ " : - ■ ■ "-. . * /' ■ ; ■ : . 

Delta-plot method. Th2 results we r6 generally supportive ojTthe previously - 
mentioned hypothesis for science, practical affairs, and human relationships, 
though oiie item, dealing with practical affairs, ran contrary to the 
hypothesis. Conrad and Walmark (1975) also found a slight tendency 
for items on science reading passages on GRE-AT to give relative favor 
to males. ' 

Contras^ting.with the above results are those of Cowell and Swlneford 
(1972). Using >the Delta comparison method to study sex differences on 
Law School Admissions Test '(LSAT) items, o they found essentially no 



Instance of interpretable item bias. Rather than usinj/^some kind of 
significance test, they came to this conclusion by examining scatter 
plots of item difficulties and noticing that the p^ibts were very tightly 
diatributed around the major axes. They pointed/^o an aspect of their , 
data that applied to ^those of other testing programs as well— that ^ 
cross-sex correlations of item difficulties /tend to be in the ye ry high 
nineties. Thus, ^departures of Delta from/the major axis are actually 
very small in numerical magnitude. Francesco (1975), in-an analysis /of ^ 
variance, found that from.80 to overy90 percent of variation in transformed 

Utem difficulties was due to itemsyalone; item by sex interactions, 

''''■-■■'/'■ ■ . - ^ ■ ' ■ ■ ■ ■ 

though significant, counted for only one to three percent. Francesco 

; - - ' . ' ■ ' ' " ' ■ ' ■ ' ' ■ . 

implies," however , that a more\sensitive examination of , differences would 



I. -^IJ^partj 



have led Cowell and Swine for,rf to d different conclusion, particular^ 

Francesco felt that the use of significance tests rather than examination 

- • . ' ' ' /■ ■ . 

of the plots would have identified outliers. . 



That sex differences in LSAT. item performance are related to item 
characteristics was^ demonstrated the unpid)lis?ied study by 
Francesco (1975)^,^ who correlated sex differences in difficulty witl^ a 
number of rated characteristics of items. Few of her correlations were 
significant /but the traditional finding that math favored males and 



verbal favored females ^as observed, and the inclusion of content on 
business^ Work, and money favored males. 

Strieker (1981) also examined GRE-AT and, with a somewhat different 
significance test, found very few items that differentially favored males 
or females in^terms of item. difficulty. He supplemented the test 
development classifications with two categories for explicit reference 
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to blacks and females, but found no isignif leant difference. He did, 
however, find significant content differences on the partial correlation 
of items wi|h sex, .controlling on total' score. In these correlations' 

the aesthetic-philosophical, human relationship, and female reference 

,.' '-^ . ^ 

partial correlations tended to be positive, while those on the world 

. * , ' 

Of practical affairs and science tended to be negative. Positive partial 
correlations indicated a favoring of females, hence these results seem 
to sui^port the previous ones , although counter, examples occurred with 
some 'frequency fn Strieker's data. • 

Sinnott (1980) analyzed sex differences in item performance on 
Graduate Management Aptitude Tests (GMAT). In her study the females 
out-performed the males on verbal materials. Her finding on national 
origin is mentioned in the achievement section of ^^^^^rt under 
national origin. 

Verbal-Race 

As with male-fema^^ifferences, race differences in verbal perfor- ^ 
mance on ^ts have long been known. Using analysis of variance at the 
item level, Cardell and Coffman( 1974) noted item by race interactions, 
as did Cleary and Hilton (1968). Angoff and Ford (1973) used the Delta 

f 

method, which was also .used by Cowell (1968) in an early unpublished 
study of the February 1963 Admission Test for Graduate School of Business 
(n$)w GMAT). Cowell found a number ofitems on which performance was 
relatively discrepant , but no interpretation of these differences was 

' ■ ■ ■ . . ■ ■ _ 

offered. ' ; ' 

A series of analyses of item performance discrepancies by race has^ 
been conducted using the Delta plot method with Scholastic Aptitude Test 
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items. As regarded one of tliese analyses', Ster^ji's (1974) only finding. 



on content -was that two items that favored whites were b&sed on science 

, ■ , ■ ■ ■ X,.;- ■ 

reading passages. No verbal content effect^*^ reported in a later 
analysis (Stern, 1975), but Cook and Stern(1975) report that narrative 
and "minority relevant" reading passages were associated wi^th reading 
comprehension items that were relatively easier for black candidates. 
Stern, in 1978, found a minority relevant passage easier for black males, 
also finding that an argumentative reading passage favored blacks and a 
humanities passage was -^relatively difficult for black fejnales. Blew 
and Ishizuka (1978) Wd Blew and Stem (1979) reported additional 
analyses of SAT Item but no content differences were noted. Examina- 
tion of all these studies reveals that (1) when a science reading compre- 
hension outlier is found it is without exception relatively more difficult 
for blacks, (2) when a social studies reading outlier is found it is 
without exception relatively easier for blacks, (3) average deviations 
from the Delta plots are consistent with the outlier findings (1) and (2),^, 
and (4) ^'minority relevant" items are found only in humaai ties and social 
studies items. 

» The finding that items based on passages judged minority-relevant 
were relatiely easier for blacks was repeated by Conrad and Wallmark ,\ . 
(1975) using the GEIE-AT. 

Strieker (1981), using his partial correlation of race with item 
performance controlled on total score, found that for males, aesthetic- 
philosophical and human relationship content favored blacks relatively; 
world of practical affairs and science favored whites relatively. In 



contrast with Cfoo'k and stern's (1975) results, Strieker's data do not 
yield Significant partial correlations ass^^^ 

"black content," ^ " ^ 

• . ' • '. ^. »•••*. ^ * " 

Math"Sex ' ^ ' \ ' 

As with SAT-verbal items, Donlon (1973) studied the difficulty t 
differences of SAT math items fo]|^ males and females . He found that, in 
the "test examined, the iteis- that seemed purely algebraic, as opposed to 
those with some story or real world content, gave males less of an 
advantage than did those with verbal content. Donlon, Ekstrom and 
Lockheed (1979) have found that verbal content ^tends to be masculine' 
oriented, and that masculine orientation is related to relatively better 
performance' by males. Therefore the relative advantage of females on the . 
-purely algebraic material might not be due to the nature of the mathematical 
processes involved but to the sex orientation of the language. Indeed,, 
to give an exception that supports the rule, items about 'shopping and the 
laundry, which are thought to be topics more closely related to women, 

relatively favored females. 

Straussberg-Rosenberg and Donlon (1975) analyzed the SAT using the- 

Delta plot method, and found that geometry and arithmetic items were 
relatively easier for malesV algebra, elementary numlier theory and 
letter addition (filling in missing digits in multiple addendum addition 
problems) were relatively easier for females. They also found that real 
world reference items tended relatively to favor males. 

in contrast to, the SAT results by Donlon and others, those by , 
Stem (I9Y8) do not reveal a consistent outlier difference in item 
performance by qex, nor do those of Conrad and Wallmark (1975), and- 



• . ' ' ■ . . . ' H 

Stpri<;ker (1981). Indeed in the report \y Dbnlon tft al» (1980) of GRE-AT"^ 

; \ . [ : ' • ^ ' ■ ■ 

math Itdms, the content interpretation \f outli^s Is not evident. 

Sinnott (1980) found some significantly discrepant perforoidnces 

./ ' V ■ ,. • .■ ; V . .. ■ . , . • 

in the GMAT. She reported a tendency for word problems in prohJ:em 

solving items relatively to favor men. One nonword problem\that favored 

• ' ■ i' ' • - ^ 

men was a geometry item. 

"i . ^ •.•.•'*■>•' , . * . • * * ^ • ■ • I. . ■■ . . 

Math-Race ' o " C 

As with verbal content, race differences have long been a known 
^aspect of mathematical test item performance, b^t few content oriented 
factors that contribute to these differences has been reporteji. In an 
early study of the ATGSB, Cowell found a tendency for items involving 
percentages to be relatively more difficult for blacks, a finding 
^ that was not repeated in Sinribtt's (1980) study of the GMAT. Neither! : 

was it repeated In an analysis conducted for CEEB in which Btaswell 

. ■ ■ ■ ■ . ■ / • . • . ; ■ 

(Cook & Stern, 1975) examined items identified as producing discrepant' 

. . ' ..* i . ■ . ■ ■■ ' . 

performance levels between blacks and whites. HoweverV Braswel! did 

note that four of six outliers dealt with ratios. In a subsequent 

CEEB analysis (Stern, 1978) four out of five items that were relatively 

harder for black females tharr white feni^es dealt with decimals or 

; fractidns, though Stern did not list this trend as a finding of the 
study. No other content inferences were given in this ^series . 

In an analysis of data for a G^-AT admlnis^tration, Conrad and 
Wallmark (1975) found that blacks have relatively more difficulty with 

ly items that required the structuring of solutions to problems by translating 
words to algebraic expressions; items with relatively less difficulty 
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for blacks .required/>more' 8traightforwar4, application quantitative 



processing. 



^ AQHIEVEMENT / ■ f/: 



,Though toany types of academic achievemenl(^ tests exist, tj>e 
that :4ve been the subject of outlier study .is few. Itesults ar^^^ 
m Virbal and English -achievement for various lan^e. ba^kwo^^f ^ ' 
Though tljere is a large literature on sex dif f erenctes .Iri ^es^fliM 



putif*^pbg^<|S on sex differences 

■the -'physica area (Lockheed. • 1982) . Fin|tlly. t^p|||p|^^?|^ ^■ 
where the achievement test items performance Q|r^<^|l^^ 

Language of Nationa l Origin f ' - 

Differential item performance that is associ|tp^th the language . ^ 
-of national origin is ofVrticular interest ' i^^0est of English as 
a Foreign Language (TOEFL). in4,art becauskj|^^dies contribute evidence 
of construct -validity of the test, which is ^s^d in foreign student 
admissions. Correlations between item difficulties suggest that differ- 
ences across languages exceed^ those found across sex "or race in the same 
language. It will be remembered $ hat across' the groups examine'd for 
verbal and math aptitude in English, the vocabulary item Delta correlations 
were in the 90' s. In contrast. Angoff andvsiiaron (1974) found a correlation 
of .73 for Delta of Spanish examinees, o|?^ of .88 Deltas of Gujarati 
examinees . against Deltas of a sample from the general TOEFL population 
which a mixture of these and othet language groups. Comparable v. 
correlations for German. Arabic. Chinese and Japanese test-takers were 



intermediate to the Spanish Gularati values. Also, Alderman and Holland 
(1981) studied TOEFVit^m difficulty intjercorre^ations for six language 
groups: Germanic, , Spanish, African, Chinese, Japanese and Arabic. These 
correlations were reported. by section for two administrations and range 
frjom .43 to .91 with a median value of .80. Alderman and Holland, ( 1981) 
also drew two Chinese samples, fchr each administrationjj^ the correlations 
of .Deltas for both pairs of samples did not drop below. .99 for any item 
type. rGlearlry, discrejjant performances are induced by differences in 
language of national origin. i 

^ Alderman artd Holland (1981) attempted to discover a linguistic 
ex^afnatix>n of the discrepant performances by asking specialists in 
English as a second language to examine the items, distracfcors and item 
statistics from one administration, then to formulate some linguistic 
explanation of the differences, and finally to apply the principles in an 
attempt to predict those it^^s that will produce discrepant performances 
in a second -administration. Unfortunately, there are no principles to 
report because the attempt was quite 'unsuccessful. 

Sinnott (1980) studied discrepant item performances for foreign 
GMAT examinees who claimed fluency in Chinese, English', French, Indo- 
Iranian, Japanese or Spanish. She also obtained performances by a sample 
of examinees that claimed U. S. citizenship. Sinnott found that One 
passage, which was answerable if one were familiar with the times of 
Roosevelt's New Deal, was relatively easier for Japanese examinees. She 
speculated that the reason that the other groups did not find the New 
Deal items especially easy was that they were not as well acquainted with 
that period of Amerlcanf History, or that their command of English as a 
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group was good enough that they needed no supplementary knowledge to read 

the paragraph effectively. No trends were unco\|ered in the section on 

practical business judgment. In the English usage section Sinnott found 

items that were relatively easy for foreign examinees and that tested 

basic principles of language—principles whose usage is prone to impifovement 

by drill. Indeed, the foreign examinees did as well or better than the 

U. S. candidates on some of these items. Sinnott 's di^ussion suggests * 

that foreign candidates deal with the English language in a formal 

way, and are not affected by awkward but correct phrasing/. • 

The use of GMAT allowed Si nnatt to present the only comparison of 
... . • / , 

foreign population on math item difficulties. Her principle finding was 

that if the item dealt with real world concepts expressed through language ^ 

the foreign candidates had more difficulty than with symbolically expressed 

problems. 

♦ - ■ • ■ . ■ • 

Achievement-Sex ' 
It has already been pointed out th^it when items are found on which 
discrepant performances by the sexes occur, science content is often 
invoked. This finding could be sharpened 'by a, study of outliers on 
physics tests. Such a study was conducted by Wheeler and Harris (1981), 
in which data from a CEEB physics achievement test were analyzed. The 
test covers mechanics, electricity and magnetism, optics and waves, heat 
and kinetic theory and modern physics. Males out-performed females on 
the test, though the authors show that the performance is related to the 
amount of preparation in physics, which favors" males . Unfortunately, 
adjustments for math aptitude were not available to the authors because 



of limited funds. ^ Delta plot analyses failed to detect any striking 
^ ' ' ; , , \ ■ i. - 

tendencies for particular physics content areas to produce Outliers;; 

except for modern physics and electricity and magnetism, which favored 

males, the content categories produced no^ outliers or produced them for 

both sexes. Modern physics produced only one' outlier, and electricity 

and magnetism produced three. The Wheeler arid Harris (1981) study is the 

only achievement test study of sex differences that was found. 

Achievement-Race / . 

; The items of , the Common Examination part of the^National Teacher *s 

Examination (NTE) have been analyzed in a study by Levin (1970), who 

> • * , 
noted that NTE Commons examination developers had made four assumptions 

.... * / 

about test performance on literature items by blacks and whites:- 
V "1. Questions on black literature, questions dealing- with black 
artists and muisicians, and questions dealing with the black 
experience will be easier for black students. 
2. Questions Calling ^or the analysis of given stimulus materials 
and questions calling for an understanding of material actually 
presented to the student, will be easier for black students 
since their preparation can be/assumed to be less thorough in 
factual material than that of his white counterpart. ^ 
■ 3. Questions that rely heavily on factual recall and that deal With 
earlier literary materials (Chaucer, Etjerson, Donne, Greek 
myths, for^example) will be more difficult for black students, 
again, because we assume their preparation to be less thorough 
in conventional and traditional literacy mat^ials. 



. \ - . 

4. Questions dealing with contemporary materials, 'with-it, \ 



"relevant" terature/ will be relatively easier for black 
students who itaight be assumed to be part icyiarly aware of 
American social^ problems and interested in material treating 
suc|i live issues\^ 
Levin noted, 

."Based on these belief 4 ,a group of items about black authors 
or black experiences w^re included in the test. As measure^ 
by th^ average delta, th^se. items proved easier for the **145 



black students from southern Negro colleges than for 200 white 
students from southern cblleges. Thus this small study bears out 
the first assumption above. \lt was the only one supported by the 
analysis', as will be seen. [Levin found that] "items on Shakespeare 
Chaucer, Oedipus,. Milton, Arnold, Donne, Keats, Wordsworth, even 
the Rivals , and questions i^^hronology and genre proved relatively 
less difficult for black students. In American literature theae ^ 
students compared favorably on questions dealing with such figures 
as Emerson and Thoreau. Thus statistics on these items suggest 
that black students^^ceive a more conventional and traditional 
literary education. Literary 'Giants , ' and major documents and 
movements would seem to be emphasized and literary history must 
have a significant place iti their training. Items which prove 
relatively more difficult dealt with Dylan. Thomas , Browning, Byron, 
Lamb, Shaw, Tennyson, Williams, Wilder, even T. S. Eliot. The 
students proved weak in literary analysis. It may well be that 



collegea emphasizing literary hiBtory tend to aciint cloao reading, 
now crltic:^mi and Intensive examination of the ^terary taxt^ , 
Certainly, thfi latter repreBenta a more modern approach to literature 
and^&llegea that train these black students may simply not have 
caught-Mp in t;his sense. ^ 

It is also quite possible t^at test questions demanding 
verbal facility, skill with dealing with nuance and drawing 
inferences are morie difficult because verbal manipulation is the 
area where disadvantaged students appear to bo^weakest^ Whatever 
the" explanation, the item statistic suf^peats that students from 

black colleges are abJLe to perform better when asked to deal with 

. , ■ — ' ' ' ' W[ 

factual material dependent upon recall if that factual material 
centers upon the most traditional elements of literary study . 
[Levih 'noted further that the black students tended to find] 
historical questions dealing with structural descriptions especially 
difficult, perhaps because the focus of these questions is not in 
accftrd with the more traditional kinds of preparation they have 
had. Instruction in the history of the language has become much 
more pervasive as required course for teachers in the last 20 years 
The field is understaffed, and stiff competition for available 
linguists is still a problem in academia. If black colleges cannot 
meet this economic competition, they may not be providing modern 
language-linguistic instruction. The generalization regarding 
historical questions dealing with structural descriptions seems to 
be supported by black students* scores on language structure. In ' 



• V ... ■., -22- ■ •• 

~'. ■ • , ') " 

• this flectloii Iterau are couched In the traditional terminology 

of grammar and the figure^ augijefit that thia thatorlal la more 

familiar to. black Student a. Compoeitlon and rhetor^^^and to a 

lesser extent the 8truct^re itema require the atudcsnt to deal with 

language in context and to abstract from a verbal Bltuation an 

^ appropriate "de8cri:()tlon or -evaluation. 

the test includes the ^mall but significant number of items on 
' * .. • ' , . 

% .. contemporary literature. It is interesting to note that there Is 

no evidence that black i^tudents are particularly strong on such 

materials as Catcher in the Rye , or Portnoy's Complaint . .If they 

are 'with it' it is in relation to 'with it' material that speaks 

to their condition, not the literature of middle-class hang-ups. 

-> ■ ^ - 

, Tolkien, Vonnegut- tennessee Williams, film directors and Orwell 

.V ^ ■ . ■ , 

ar^ relatively more difficult for them." 
*-Wlth this, result, three of the four Jiypotheses are contradicted and the 
definition of >b Jack content" was formulated as content that, while 
difficult for most examinees, is specif ically related to black examinees 
as well, as being credibly related to the. academic field with which the 
examination dtedls, 

Humihrles (1979) studild the relative performance of blacks versus 
the total groUi>^of NTE Commons Examinations takers. This examination 
has several sections. Ite^ within each of several categories in each 
section were classified on judgmental grounds as having highly similar 
content. Humphries calculated the mean percents pass for the items 
in each content category. When these percents are transformed Into 
Deltas -and examined one notes that . science was relatively difficult for 
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the blacks and Social studies relatively easy. Some difficulty in- 
Interpreting this study is typified by a^esult in the Prof eissional • 
Ediication area. One of the contents in^Prof essiorial Education deals 
with history j philosophy and social'; deyielop^^^t relative to education. 
This content was relatively most difficult for blacks m 
the other contents ^n"der Professioiial Education. Howiever, when, the ; ; 
Professional Education items were ;seperately examined via the Delta ^plot 

• ■■ . ■ ,'■ ■ ■ ... ■ ■ ■ N . \ ■ • ■■ ' ' • ■■■■■■■■■■ ■ '•" ■:• 

method, ,the outliers were all from other contents! Hence the interptetation 
of the results Of this study is jjuite unclear. Even so, the attempt to 
formulate anrf evaluate content^ategories that may be 'related to differen- 
tial item perform^ce is a highly 4esireabJ.e feature of this atudy. 

' — DISCUSSION • , 

The' e^ififtence of a number of "outlier studies of tests commonly used 
' in higher education invites an integration of results; ^tl^^ patterns of 
confirmatiort and* contrast may liead to broader cqnclusions or conjectures 
tharl could be reachecji by focusing on any one test or, test program.*^ 
Granted that different methods arid authors are involve*,- still it is 
likely that;^ any exisll^ strong b'asic variance £n differential performance 
produced by content vacation should come through. One lool^s for patterns 
Of' results that might suggest experlmehl^^l or survejr research to conduct 
critical tests of hypotheses* based on the conjectures, including research 
on the social processes that might be responsible- for item performance^ ^ 
differential. One attempts to turn hindsight into foresight for subsequent 
forms to emerge from the tes^development process. Conceivably curriculum 
recommendations could result. • * /: * . 



, The implications to be drawn here must , however, be constrained by 

the fact that Ino content classification has been discovered , by which 

-> . ". . ■ . ■■ ' , ■ • ■ ■ ■ . ' . • . . ■, . ■ . ■ ' 

one can identify an outlier with confidence. Estimates of the actual 

magnitude.of tl»e effects for any particular classification have been 
infrequently inade, in part because each classification includes some 
non^outliers, which are not examined it! outliet studies. With more • 
extensive studies of the con teat classifications that have been used 
to date one would expect content ^effects Me significant, but taodest, 
in size. This, expectation exists because "the correlations .between ' . 

Deltas for groups for whom English is essentially the basic language 
tend to-be in the nineties. These large correlations leave little room 
for deviation from the major axis of the Delta plots. But where language 
differences are involved the opportunity for differential performance is 
much greater. It has been pointed out that lower correlations obtained 
when Deltas for items from various language groups^lkinj/ the Test of 
English as a Foreign Language are compared. The lowest correlation 
between. Deltas, .,60, was observed by Angoff and Modu (1973) who adminis- 
tered Spanish- and English-language, versions of the same items to Puerto 
Rl,can and American students respectarvely, and then correlated the resulting 
item Deltas. Unfortunately, ho reliable content effects on the differential 
item performance of groups with different language backgrounds have been 
catalogued as yet. . 

V Though one cannot make a priori classifications of outliers 
with confidence, one cart with reasonable confidence predict the relatively 
'advantaged group for many items if they subsequently proye to be outliers. 
For^verbal items with aesthetic-philosophical , human relations or female 



oriented content one would predict a relfi^tlve advantage for females as 
oppiosed to males. Fo^ verbal Items involving practical affairs , science 
or male oriented content one would predict a relative advantage for males 
as opposed to females. For veifbal items* with science coatent one would 
predict a relative 'advantage for whites as opposed to blacks. For test 
con^^ that varies in the degree or relatedness to minorities, one would 
predict a relative advantage for those outliers that are most related to 
i^inorities. 

These findings do not support the notion that "bias" is what 
outlier studies discover. True, certain of the "female-" and "mlnority- 
related" items introduce content that seems clearly extraneous to the 
purpose of the test.' But other content categories can be regarded 
differently^ For example^ the existence of a relative disadvan^tage for 
blacks on physical science oriented material doyesn't necessarily mean 
that verbal items should not be administered with physical science 
content; too little is known about aptitudes to make that decision, 
though excision of the material is one alternative to consider. What the 
finding does point up is an educational deficit that might .be worth 
addressing further. 1 Clearly the findings of Wheeler and Harris (1981) do 



not establish than certain facts regarding modern physics and magnetism; 
*should be struck out of a physics achievement examination because females, 
exhibited a relative disadvantage when asked about them. Rather the' 
subject is what it 'is, -?but there are some educational problems as regards 
women . 

One must also keep in mind that the content categories, refer to 
content encountered in test items and should not be generalized beyond 



that context. Consider, for example, the finding that human relations 
relatively favors women and that practical affairs' such as business 
relatively favors men. These findings might seem. strange to students 
of management and business, wholearn that interpersonal relations 
will be extremely important in their futures^ and who might conclude 
that skill in human relations is essential to much that is practical 
in affairs. To them, success in practical af fair 8_ent ails success in 
human relations co^rary to what the test items might seem to imply. 
While there is prolRiy a way to rationalize this apparent contradiction, 
one might nevertheless learn from it that the translation of "regularities 
"in the world" to principles concerning the effects of test item content 
on differential item performance, as well as in the other direction, is 
not straightforward. Theories about what makes tests "biased" are 
probably oveirsimple. ' ' ' 

In the paragraphs above it has been noted that the outlier studies, 
which originated in the context of bias research, don' t necessarily study 
bias and may not be able to discover large effects even if those; effects 
prove reliable. To avoid being overcritical, 4iote that the 
studies were reasona<ble to do and would hav^ yielded a conslj 
payoff for a very modest investment if the results had been more substantial 
and interpretable. To make progress/ though, it seems that all of the 
traditional scientific stages are needed: collection of anecdotes, 
fomulation of hypotheses, survey and experiment:,N^andytheory construction 
and confirmation. We seem currently 'to be in the anecdotal stage having 
some di f f i cul ty in th e formula tion of hypo the ses . We are in the anecdo tal 



outlier 
^erable 
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stage in that we are observing the pertinent events under uncontrolled * 
conditions afforded Iby the use of operational tests which are not designed 
to test these hypotheses. Outlier items, each wJLth many characteristics , 
are identified with the one; cha^ to be of interest ^ 

in a particular- study being selected as the reason for the variation, 
i.e., the "cause V is selected and the anecdote is completed. But for the 
tnost part, the enunciation of generaj,ized principles and the evaluation 
and modification of those principles through survey and experimentation 
has not occurred. Ii^^fd, mechanisms have bieen applied to the operational 
tests to ensure that^ variation in item characteristics will be limited, 
one such mechanism is the test sensitivity review (Hunter & Slaughter, 
1980), which is intended to eliminate possible offensive items from the 
test. We are not suggesting that offensive i|:ems must be introduced into 
opfi'^^^ional tests; we are suggesting that the screening of items should 
be accompanied by an empirical evaluation of its effects. Surely such 
screening has a value, indeed is essential in the modern political 
climate. But to ^^ply^ so rigidly that there is no opportunity to 
evaluate the reality of any ant icipated consequences , to test behavior may 
not be the best course, because the study of differential item performance 
is worth pursuing for several reasons . First, a continuing effort to 
produce better tests is generally regarded as desireable. For example, 
where test it eims handicap a particular group by introducing content that 
is essentially irrelevant to the characteristic being measured, that 
content should be discovered, modified, eliminated or counterbalanced 
^ith content that works in the. opposite direction. Indeed, fairne 
demands that extraneous influences be eliminated or balanced. But if no 



dif ferentiai effeQts of consequence can be discovered and the intuitively 
obvious hypotheses fail, the documentation of the research that yielded 
the null results moves the burden of proof to the critics of the tests. 
Second, the test population can be conveniently partitioned using on a 
variety of demographic characteristics because there is considerable 
demographic variation in the examinee population, Fremer ( 1981) , for 
example, has suggested studying differential item performances of rural 
and urban examinees — a partitioning for which he has some anecidotal dat;a 
and one that is otherwise unexamined, Thir^i, differential item performance 

studies have logistic value for develqping scientific underistanding. The 

■ tf ' . - . . . - ■ • • "■ 

items vary in jnany ways that are not described by subtest titles and can 
sample a wide variety of cognitive performances. The logistic mechanisms 
for doing differential item- performance studies are convenient, as 
compared with other scientific manipulations, since they could be intro- > 
duced as minor manipulations in the context of ongoing testing efforts; 
the results, if phrased in suitable terms, could lead to hypotheses, 
research and generalizations concerning other types of cognitive 
performances, * . - 

It should be mentioned that there has been some movement in the 
•direction of hypothesis development.' Levin (1970) has been quoted 
extensively to show how she departed from a set of\h^ptheses -and arrived, 
through the examination of data, at a better understanding of the test 
performances of her NTE exan4.nees, Donlon (1973) payed careful attention 
to the conceptual aspect of his work and was able to improve, on the 1 
hypotheses formulated by-Coffman (1961). Alderman and/Holland (1981) 
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Included the use of substantive experts in an attempt to develop conceptual 
izations. Francesco (1975) and Strieker (1981) toth applied conceptualizations 
to all items and evaluated the results empirically, Scheunemann^s (1981) 
ongoing study involves manipulating items in accordance with hypothesized 
characteristics to see if control over differential performance .can be ^ 
achievied« This latter study moves even further down the line of scientific 
development, \ ' 

While we might .not recommend all^^Che details of Francesco *s ( 1975) 
methodology, the general approach merits special emphasis. She obtained 
"ratings of test items using judges who were not aware of differential 
item performance data. The items were rated on seyeral characteristics; 
correlation and regression analyses used the ratings as independent ^ 
variables, with it^m performance differences between groups taking the 
criterion side. Such studies have the following advantages: (1) they 
treat the items as having the characteristics in degree rather than as 
all or none phenomena, (2) all of the items contribute to determining the 
relationship between the characteristics and differential item performance, 
(3) the study creates defined characteristics that can be carried ov^r 
from study to study rather than formulating them ad hoc, (4) the study 
anticipates future application by defining the judgments to be made, and 
(5) one set? of data can, rather conveniently, be the subject of several 
rating studies that can lead to improved rating criteria. 

Methodologies other than those on which the outlier studies are 
based are also needed, especially if one raises the possibility that 
the groups on which data have commonly been conditioned in the outlier ^ 
are not most likely to be those whose use would lead to control of 



differential performance. Perhaps it would be more effective to discover 
-similar performing groups using only response data. Years ago Lazarsfeld 
confronted the problem of inferring a typology of examinees based on 
dichoeomous data, whi<;b,^ indeed another way to approach the problem of 
differential item performance. As opposed to the recent outlier studies, 
which ask "^e there items that differ for certain groups in terms of 
their difficulties?", studies based on LazaTsf eld 's latent class model 
(Green, 1951) would ask "Are there groups with different vectors of 
probabilities of correctly responding to the items?" Actually, if the 
Lazarsfeld question is revised to classify as group members those whose 
probability of correct responses are proportional,' then the two questions 
becomfe different sides of the same polyhedral solid. For unless more 
than one substantially sized group of the type Lazarsfeld sought exists, 
partitioning subjects to produce item difficulty plots shouldn't yield 
meaningful results regardless of the basis of the partitioning. If the 
exj^stance of more t^an one group of appreciable size is indicated, the 
nature of the groups would be sought by examining item difficulties for 
the groups as well as demographic or other data. Such a study wo'ul,d no 
doubt have the not insignificant advantage of demonatra ting that even 
though sex, ethnic and racial groups are related to the types of groups 
the relationship is not perfect, and hence we would be more reluctant . to 
conclude that ethnic group membership "causes" performance differences.- 

Another side of the polyhedral solid is occupied by Donlon (1968) 
who has suggested that individuals ' responses could be correlated 
with item difficulties to obtain a "personal biserlal" correlation 



coefficient^ People who get easy Items wrong and hard Items, right to 

*' ' ' ■ • • • 

an appreciable extent differ from the majority; ""if those with low 

personal biserlals made» among themselves , similar responses theyyWould 

constitute a group In th^ Lazarsfeld sense. Thus, as with the latent 

structure model, a product of the analysis could be the Identification 

of groups with members who respond similarly* It should be mentioned 

that Harnlsh and Linn (1981) have discussed a whole family. o£^ Indices 

that are related to the personal blserlal; Tatsuoka and Linn (1981)- have 

dl96ussed other approaches Including that of Item response theory* 
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